library(tidyverse) # for graphing and data cleaning
library(tidymodels) # for modeling
library(themis) # for step functions for unbalanced data
library(doParallel) # for parallel processing
library(stacks) # for stacking models
library(naniar) # for examining missing values (NAs)
library(lubridate) # for date manipulation
library(moderndive) # for King County housing data
library(vip) # for variable importance plots
library(patchwork) # for combining plots nicely
theme_set(theme_minimal()) # Lisa's favorite theme
data("lending_club")
# Data dictionary (as close as I could find): https://www.kaggle.com/wordsforthewise/lending-club/discussion/170691
When you finish the assignment, remove the # from the options chunk at the top, so that messages and warnings aren’t printed. If you are getting errors in your code, add error = TRUE so that the file knits. I would recommend not removing the # until you are completely finished.
From now on, GitHub should be part of your routine when doing assignments. I recommend making it part of your process anytime you are working in R, but I’ll make you show it’s part of your process for assignments.
Task: When you are finished with the assignment, post a link below to the GitHub repo for the assignment. Make sure the link goes to a spot in the repo where I can easily find this assignment. For example, if you have a website with a blog and post the assignment as a blog post, link to the post’s folder in the repo. As an example, I’ve linked to my GitHub stacking material here.
LINK to Cande’s repository: https://github.com/MCTJ1998/ADS_Assignment.git
We’ll be using the lending_club dataset from the modeldata library, which is part of tidymodels. The data dictionary they reference doesn’t seem to exist anymore, but it seems the one on this kaggle discussion is pretty close. It might also help to read a bit about Lending Club before starting in on the exercises.
The outcome we are interested in predicting is Class. And according to the dataset’s help page, its values are “either ‘good’ (meaning that the loan was fully paid back or currently on-time) or ‘bad’ (charged off, defaulted, or 21-120 days late)”.
Tasks:
lending_club %>%
count(Class)
lending_club %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(vars(variable),
scales = "free")
lending_mod <- lending_club %>%
mutate(Class = as.factor(Class)) %>%
mutate(across(where(is.character), as.factor)) %>%
select(-funded_amnt,
-verification_status,
-annual_inc) %>%
add_n_miss() %>%
filter(n_miss_all == 0) %>%
select(-n_miss_all)
lending_mod %>%
select(where(is.factor)) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value") %>%
ggplot(aes(x = value)) +
geom_bar() +
facet_wrap(vars(variable),
scales = "free",
nrow = 2)
Class (add strata =Classto theinitial_split()` function).set.seed(494) # for reproducibility
lending_split <- initial_split(lending_club,
strata= 'Class',
prop = .75)
lending_training <- training(lending_split)
lending_test <- testing(lending_split)
step_upsample() from the themis library to upsample the “bad” category so that it is 50% of the “good” category. Do this by setting over_ratio = .5.step_downsample() from the themis library to downsample the “good” category so the bads and goods are even - set under_ratio = 1. Make sure to do this step AFTER step_upsample().step_mutate_at() and using the all_numeric() helper or this will be a lot of code). This step might seem really weird right now, but we’ll want to do this for the model interpretation we’ll do in a later assignment.Once you have that, use prep(), juice(), and count() to count the number of observations in each class. They should be equal. This dataset will be used in building the model, but the data without up and down sampling will be used in evaluation.
set.seed(456)
lasso_recipe <- recipe(Class ~ .,
data=lending_training) %>%
#Pre-processing:
step_upsample(Class, over_ratio = 0.5) %>% #This function creates a specification that will replicate rows of a data set to make the occurrence of levels in a specific factor level equal.
step_downsample(Class, under_ratio = 1) %>%
step_mutate_at(all_numeric(),
fn= ~as.numeric(.)) %>%
#step_novel(all_nominal_predictors()) %>%
step_nzv(-all_outcomes()) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
lasso_recipe %>%
prep(lending_training) %>%
# using bake(new_data = NULL) gives same result as juice()
# bake(new_data = NULL)
juice()